FastQC is a set of tests that can be used to assess the quality of RNA-seq data. At each step of an RNA-seq pipeline, the number of low-quality reads is reduced:

  1. __Trimming of raw data:__ This step involves filtering the raw reads based on a cutoff quality score (reads with quality > 20 are retained) and length (reads with length > 35 bp are retained). The adapter sequences are also removed in this step.
  2. Removal of rRNA sequences: This step involves the removal of rRNA sequences. This is a problem with this particular dataset, as several libraries have significant levels of rRNA sequences, likely due to inadequate rRNA-removal during library preparation. If this step is not completed, the downstream differential gene expression analysis might detect these rRNA sequences as being differentially expressed, even when this is not true biologically.
  3. Alignment of rRNA-removed sequences to reference zebrafish genome: Reads that do not align are removed.
  4. Deduplication of aligned reads: Reads that do not uniquely align, or appear to be optical or PCR duplicates, are removed.
  5. Removal of DNA contamination sequences: Reads that appear to be DNA contamination are removed.

Setting Up

At each step of the RNA-seq pipeline, FastQC was run to assess the quality of the reads. First, we extract the FastQC information from the FastQC reports located at /data/biohub/ on Phoenix.

dataDir <- "/Volumes/biohub/2017_Lardelli_K97Gfs"
fastQC_0_raw <- list.files(file.path(dataDir, "0_rawData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
## Warning: package 'bindrcpp' was built under R version 3.3.2
fastQC_1_trim <-list.files(file.path(dataDir, "1_trimmedData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_2_rRNArem <- list.files(file.path(dataDir, "2_rRNAremovedData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_3_aligned <- list.files(file.path(dataDir, "3_alignedData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_4_dedup <- list.files(file.path(dataDir, "4_dedupData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_5_clean <- list.files(file.path(dataDir, "5_cleanRData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_all <- as.list(fastQC_0_raw, fastQC_1_trim, fastQC_2_rRNArem, fastQC_3_aligned, fastQC_4_dedup, fastQC_5_clean)

Summary Plots

Summary plots can be produced to summarise the PASS, FAIL or WARNING statuses for each of the FastQC tests for each RNA-seq library:

fastqcReports::plotSummary(fastQC_0_raw)

fastqcReports::plotSummary(fastQC_1_trim)

fastqcReports::plotSummary(fastQC_2_rRNArem)

fastqcReports::plotSummary(fastQC_3_aligned)

fastqcReports::plotSummary(fastQC_4_dedup)

fastqcReports::plotSummary(fastQC_5_clean)

Important Points

RAW DATA

  • Several RNA-seq libraries have WARNING statuses for their Per base sequence quality.
  • Several RNA-seq libraries have WARNING statuses for their Per sequence GC content.
  • Several RNA-seq libraries have WARNING statuses for their Sequence Length Distribution.
  • All RNA-seq libraries have WARNING statuses for their Sequence Duplication Levels and three of these libraries, corresponding to read 1 in library 12, and reads 1 and 2 for library 6. Interestingly, library 6 also has a significantly larger number of reads compared to the other libraries.
  • All RNA-seq libraries have WARNING statuses for their Overrepresented sequences.
  • Several RNA-seq libraries have FAIL statuses for their Adapter Content, indicating that these still need to be removed.
  • All RNA-seq libraries have FAIL statuses for their Kmer Content.

TRIMMED DATA

  • Trimming data has changed the result of Per base sequence quality to PASS for all RNA-seq libraries. This is as expected, as the trimming step also includes a quality-filtering step to filter out all reads with quality score < 20.
  • Trimming data has successfully removed the adapter sequences in the raw RNA-seq libraries.
  • Trimming data has apparently changed the Sequence Duplication Levels status of library 12, read 1 from FAIL to WARNING.
  • The per base sequence content has a FAIL status with all samples.
  • GC content is unaffected by trimming.
  • Trimming data has resulted in several PASS statuses for Sequence Length Distribution changing into WARNING statuses.

Read Totals

The number of reads lost/discarded at each step are shown in the following table.

RNAseq_Library Raw Trimmed rRNA_removed Aligned Deduplicated DNA_removed
1_non_mutant_K97Gfs_24mth_13_03_2014_S1_fem_R1 54,955,454 54,893,882 54,893,790 50,933,145 38,308,550 36,879,228
10_psen1K97Gfshet_6mth_10_03_2016_S1_fem_R1 78,164,332 78,117,636 78,117,166 72,377,056 54,263,688 52,047,915
11_psen1K97Gfshet_6mth_10_03_2016_S2_fem_R1 88,387,480 88,341,182 88,340,660 81,314,599 61,911,413 59,562,064
12_psen1K97Gfshet_6mth_10_03_2016_S3_fem_R1 76,569,314 76,532,984 76,532,406 71,789,420 53,280,588 51,351,547
2_non_mutant_K97Gfs_24mth_13_03_2014_S2_fem_R1 69,810,372 69,742,212 69,742,048 64,891,457 48,554,022 46,752,384
3_non_mutant_K97Gfs_24mth_13_03_2014_S3_fem_R1 57,292,450 57,227,560 57,227,430 53,481,065 39,742,298 38,238,799
4_non_mutant_K97Gfs_6mth_10_03_2016_S1_fem_R1 92,739,344 92,685,264 92,684,444 85,502,011 66,177,061 63,677,470
5_non_mutant_K97Gfs_6mth_10_03_2016_S2_fem_R1 78,013,106 77,935,126 77,934,538 73,657,946 56,904,193 54,654,502
6_non_mutant_K97Gfs_6mth_10_03_2016_S3_fem_R1 166,154,748 166,061,488 166,060,174 153,364,755 108,442,547 104,122,705
7_psen1K97Gfshet_24mth_13_03_2014_S1_fem_R1 54,364,124 54,299,138 54,299,006 50,269,951 38,004,152 36,585,600
8_psen1K97Gfshet_24mth_13_03_2014_S2_fem_R1 71,216,754 71,142,748 71,142,274 65,911,748 49,193,090 47,231,972
9_psen1K97Gfshet_24mth_13_03_2014_S3_fem_R1 76,720,686 76,631,824 76,631,444 71,210,915 52,995,751 51,048,580

HISAT2 Logs

hisat2logs_3_aligned <- list.files(file.path(dataDir, "3_alignedData", "info"), pattern = ".info", full.names = TRUE) %>% 
  importHisat2Logs() %>%
  extract(, c("Filename", "Total_Reads", "Unique_In_Pairs", "Not_Aligned", "Alignment_Rate"))

fastqcReports::readTotals(fastQC_0_raw) %>% View
fastqcReports::plotSequenceQualities(fastQC_1_trim)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_2_rRNArem)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_3_aligned)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_4_dedup)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_5_clean)  + guides(colour = FALSE)

Per Base Sequence Content

fastqcReports::plotSequenceQualities(fastQC_0_raw)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_1_trim)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_2_rRNArem)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_3_aligned)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_4_dedup)  + guides(colour = FALSE)

fastqcReports::plotSequenceQualities(fastQC_5_clean)  + guides(colour = FALSE)

GC Content

fastqcReports::plotGcContent(fastQC_0_raw) + guides(colour = FALSE)

fastqcReports::plotGcContent(fastQC_1_trim) + guides(colour = FALSE)

fastqcReports::plotGcContent(fastQC_2_rRNArem) + guides(colour = FALSE)

fastqcReports::plotGcContent(fastQC_3_aligned) + guides(colour = FALSE)

fastqcReports::plotGcContent(fastQC_4_dedup) + guides(colour = FALSE)

fastqcReports::plotGcContent(fastQC_5_clean) + guides(colour = FALSE)